Documents formatted in a portable document format (i.e. a PDF document) are commonly used to simplify the display and printing of structured documents. Such documents permit incorporation of a mix of text and graphics to provide a visually pleasing and easy to read document across heterogeneous computing environments. It is estimated that there are currently about 2.5 trillion files on the World Wide Web encoded as PDF documents.
It is often necessary to extract text from a document encoded in a portable document format. For example, text may be extracted from a document to (1) provide narration for the document via synthesized speech, (2) reflow the document for viewing on the small screen of a mobile device, (3) facilitate reading accessibility for visually impaired and motion-impaired users, (4) copy text from the document for pasting into another document, (5) analyze the document text, (6) search the document for phrases, (7) operate on text, (8) summarize the document, or (9) export the document to another format. Current tools can identify contiguous portions of text but unfortunately do not accurately identify discontinuous portions of text, for example, text that may be in multiple columns and that may be interspersed around images or other visual elements. Accordingly, once the text is extracted, the extracted text segments must be ordered and/or re-ordered so that the text segments are presented in a proper and logical reading order.
To identify reading order within text, some existing technologies generate documents with tags to indicate portions of text, but many existing documents are not tagged, and tagging tools cannot always correctly tag existing documents. Other technologies employed to identify reading order include segmenting and labeling segments, such as “title” and “body,” using spatial information within a document to determine document structure, applying topological sorting. Unfortunately, none of these solutions provides a sufficiently flexible solution for the large number of existing documents encoded in a portable document format.